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Compressed Data Structures for Dynamic Sequences 


J. Ian Munro* Yakov NekricD 


Abstract 

We consider the problem of storing a dynamic string S over an alphabet E = { 1,..., a } in 
compressed form. Our representation supports insertions and deletions of symbols and answers 
three fundamental queries: access(«, S) returns the i-th symbol in S, rank a (i,S) counts how 
many times a symbol a occurs among the first i positions in S, and select 0 (i, S) finds the position 
where a symbol a occurs for the i-th time. We present the first fully-dynamic data structure 
for arbitrarily large alphabets that achieves optimal query times for all three operations and 
supports updates with worst-case time guarantees. Ours is also the first fully-dynamic data 
structure that needs only nHj. + o(n log a) bits, where Hk is the k -th order entropy and n is 
the string length. Moreover our representation supports extraction of a substring S[i..i + P\ in 
optimal 0(\ogn/\og\ogn +i/log^n) time. 


1 Introduction 

In this paper we consider the problem of storing a sequence S of length n over an alphabet 
E = {1,...,ct}so that the following operations are supported: 

- access^,.?) returns the i-th symbol, S[i], in S 

- rank a (i, S) counts how many times a occurs among the first i symbols in S, rank a (i, S) = 
\{j I S[j] = a and 1 < j < i}| 

-select a (i, S) finds the position in S where a occurs for the i-th time, select a (z,5) = j where j is 
such that S[j] = a and rank a (j. S ) = i. 

This problem, also known as the rank-select problem, is one of the most fundamental problems in 
compressed data structures. There are many data structures that store a string in compressed form 
and support three above defined operations efficiently. There are static data structures that use 
nHo + o(n log a) bits or even nHk + o(n log a) bits for any k < a log^ n — 1 and a positive constant 
a < lQ- Efficient static rank-select data structures are described in [101 OS, 12, . 251 [26l [3, r 20l f35l ;~6]. 
We refer to [6] for most recent results and a discussion of previous static solutions. 

In many situations we must work with dynamic sequences. We must be able to insert a new 
symbol at an arbitrary position i in the sequence or delete an arbitrary symbol S[i]. The design 
of dynamic solutions, that support insertions and deletions of symbols, is an important problem. 
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1 Henceforth Ho(S) = )C a e£ ~ 1°S 7r> where n a is the number of times a occurs in S , is the O-th order entropy 
and Hk(S) for k > 0 is the fc-th order empirical entropy. Hk(S) can be defined as Hk{S) = ISUIHofSA), 

where Sa is the subsequence of S generated by symbols that follow the fc-tuple A; Hk(S) is the lower bound on the 
average space usage of any statistical compression method that encodes each symbol using the context of k previous 
symbols [25] , 
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0(A) 
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New 

nH k + o(n log cr) 

0(A) 

O(A) 

0(A) 

O(A) 
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Table 1: Previous and New Results for Fully-Dynamic Sequences. The rightmost column indicates 
whether updates are amortized (A) or worst-case (W). We use notation A = log nj log log n in this 
table. 


Fully-dynamic data structures for rank-select problem were considered in [2TL flOl [8l [9l. 191 28. 22]. 

Recently Navarro and Nekrich [33 \ [34] obtained a fully-dynamic solution with 0(logn/loglogn) 
times for rank, access, and select operations. By the lower bound of Fredman and Saks m, these 
query times are optimal. The data structure described in [33] uses nHo(S) + o(n log a) bits and 
supports updates in 0(logn/loglogn) amortized time. It is also possible to support updates in 
O(logn) worst-case time, but then the time for answering a rank query grows to O(logn) [33]. All 
previously known fully-dynamic data structures need at least nHo(S) + o(n log cr) bits. Two only 
exceptions are data structures of Jansson et al. [23] and Grossi et al. m that keep S in nH k {S) + 
o(n log a) bits, but do not support rank and select queries. A more restrictive dynamic scenario was 
considered by Grossi et al. m and Jansson et al. [23]: an update operation replaces a symbol S[i] 
with another symbol so that the total length of S does not change, but insertions of new symbols 
or deletions of symbols of S are not supported. Their data structures need nH k (S) + o(n log cr) bits 
and answer access queries in 0(1) time; the data structure of Grossi et al. [T7] also supports rank 
and select queries in 0(logn/loglogn) time. 

In this paper we describe the first fully-dynamic data structure that keeps the input sequence 
in nH k {S) + o(n log cr) bits; our representation supports rank, select, and access queries in optimal 
0(logn/loglogn) time. Symbol insertions and deletions at any position in S are supported in 
0(logn/loglogn) worst-case time. We list our and previous results for fully-dynamic sequences 
in Table [T] Our representation of dynamic sequences also supports the operation of extracting 
a substring. Previous dynamic data structures require 0{£) calls of access operation in order to 
extract the substring of length i. Thus the previous best fully-dynamic representation, described 
in [33] needs 0(£(logn/loglogn)) time to extract a substring S[i..i + £ — 1] of S. Data structures 
described in m and [23] support substring extraction in 0(logn/loglogn + £/\og a n) time but 
they either do not support rank and select queries or they support only updates that replace a 
symbol with another symbol. Our dynamic data structure can extract a substring in optimal 
0(logn/ log log n + £/ log^ n) time without any restrictions on updates or queries. 

In Section [2] we describe a data structure that uses O(logn) bits per symbol and supports rank, 
select, and access in optimal 0(logn/loglogn) time. This data structure essentially maintains 
a linked list L containing all symbols of 5; using some auxiliary data structures on L, we can 
answer rank, select, and access queries on S. In Section [3] we show how the space usage can 
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be reduced to O(logcr) bits per symbol. A compressed data structure that needs Hq(S) bits 
per symbol is presented in Section [4[ The approach of Section [4] is based on dividing S into a 
number of subsequences. We store a fully-dynamic data structure for only one such subsequence 
of appropriately small size. Updates on other subsequences are supported by periodic re-building. 
In Section [5] we show that the space usage can be reduced to nHk(S) + o(n log a). 

2 0(n log n)-Bit Data Structure 

We start by describing a data structure that uses O(logn) bits per symbol. 

Lemma 1 A dynamic string S^Um] for m < n over alphabet £ = {1,... ,<?} can be stored in 
a data structure that needs 0{m\ogm) bits, and answers queries access, rank and select in time 
O (log m/ log log n). Insertions and deletions of symbols are supported in 0(log m/ log log n) time. 
The data structure uses a universal look-up table of size o{n e ) for an arbitrarily small e > 0. 

Proof: We keep elements of S' in a list L. Each entry of L contains a symbol a E £. For every 
a € £, we also maintain the list L a . Entries of L a correspond to those entries of L that contain 
the symbol a. We maintain data structures D(L ) and D{L a ) that enable us to find the number of 
entries in L (or in some list L a ) that precede an entry e E L (resp. e E L a ); we can also find the 
z-th entry e in L a or L using D(L.). We will prove in Lemma [6] that D{L ) needs ()(m log m) bits 
and supports queries and updates on L in O (log m/ log log n) time. 

We can answer a query select a (z, S) by finding the z-th entry in L a , following the pointer 
from ej to the corresponding entry e' € L, and counting the number v of entries preceding e! in 
L. ClearljH, select a (z, S) = v. To answer a query rank a (z,S'), we first find the z-th entry e in L. 
Then we find the last entry e a that precedes e and contains a. Such queries can be answered in 
0((log log a) 2 log log m) time as will be shown in Lemma [5] in Section [A. 11 If e' a is the entry that 
corresponds to e a in L a , then rank a (z, S) = v, where v is the number of entries that precede e' a in 
L a . □ 

3 0(n log cr)-Bit Data Structure 

Lemma 2 A dynamic string S'fl, n] over alphabet £ = { 1,..., a } can be stored in a data structure 
using 0{n log a) bits, and supporting queries access, rank and select in time 0(logra/loglogn). 
Insertions and deletions of symbols are supported in 0(logn/loglogn) time. 

Proof: If a = log 0 * 1 ) n, then the data structures described in [35] and m provide desired query 
and update times. The case a = log^* 1 ) n is considered below. 

We show how the problem on a sequence of size n can be reduced to the same problem on a 
sequence of size O(oTogn). The sequence S is divided into chunks. We can maintain the size n* of 
each chunk Cj, so that n* = 0(a log n) and the total number of chunks is bounded by 0(n/a). We 
will show how to maintain chunks in Section I A. 31 For each a E £, we keep a global bit sequence 
B a . B a = l dl 01 d2 0... l rfi 0 ... where di is the number of times a occurs in the chunk Cj. We 
also keep a bit sequence B t = l ni 01 n2 0... 1^0 .... We can compute rank a (z, S) = v\ + v-}. where 
v\ = ranki(select 0 (ji, B a ), B a ), j\ = rank 0 (selecti(z, B t ), B t ), v 2 = rank a (zi, C i2 ), i 2 = ji + 1 and 

2 To simplify the description, we assume that a list entry precedes itself. 


3 







i\ = i — ranki(selecto(ji, B t ), B t ). To answer a query select a (z, S), we first find the index of 
the chunk Cj 2 that contains the i-th occurrence of i, Z 2 = ranko(selecti (i, B a ), B a ) + 1. Then we 
find v a = select a (C',; 2 , i — i\) for i\ = ranki(selecto(i 2 — 1 ,B a ),B a )-, v a identifies the position of 
the (i — ii)-th occurrence of a in the chunk C\ 2 , where i\ denotes the number of a’s in the first 
*2 — 1 chunks. Finally we compute selector, S) = v a + s p where s p = ranki(selecto (*2 — 1 ,B t ),B t ) 
is the total number of symbols in the first 12 — 1 chunks. We can support queries and updates on 
B t and on each B a in 0(logn/loglogn) time [35] . By Lemma [U queries and updates on Q are 
supported in 0(log a/ log log n) time. Hence, the query and update times of our data structure are 
O (log n / log log n ). 

B t can be kept in 0((n/a) log a) bits [33] • The array B a uses 0(n a log ^-) bits, where n a is the 
number of times a occurs in S. Hence, all and use 0((n/cr) log cr+^ a n a log ^-) = 0(n logo - ) 
bits. By Lemma [1] we can also keep the data structure for each chunk in 0(logcr + log log n) = 
0(log o ) bits per symbol. □ 

4 Compressed Data Structure 

In this Section we describe a data structure that uses Hq(S) bits per symbol. We start by con¬ 
sidering the case when the alphabet size is not too large, cr < n/log 3 n. The sequence S is split 
into subsequences So, S±, ... S r for r = 0(logn/(loglogn)). The subsequence So is stored in 
O(logcr) bits per element as described in LemmaEJ Subsequences S 1 ,... S r are substrings of S’\S’o. 
S \,... S r are stored in compressed static data structures. New elements are always inserted into 
the subsequence So- Deletions from Si, i > 1, are implemented as lazy deletions: an element in 
Si is marked as deleted. We guarantee that the number of elements that are marked as deleted is 
bounded by 0(n/r). If a subsequence Si contains many elements marked as deleted, it is re-built: 
we create a new instance of Si that does not contain deleted symbols. If a symbol sequence So 
contains too many elements, we insert the elements of So into S* and re-build Si for i > 1. Processes 
of constructing a new subsequence and re-building a subsequence with too many obsolete elements 
are run in the background. 

The bit sequence M identifies elements in S that are marked as deleted: M\j] = 0 if and only 
if S[j] is marked as deleted. The bit sequence R distinguishes between the elements of So and 
elements of S t , i > 1: R[j] = 0 if the j -th element of S is kept in So and R[j] = 1 otherwise. 

We further need auxiliary data structures for answering select queries. We start by defining 
an auxiliary subsequence S that contains copies of elements already stored in other subsequences. 
Consider a subsequence S obtained by merging subsequences Si, ..., S r (in other words, S is 
obtained from S by removing elements of So). Let S' a be the subsequence obtained by selecting 
(roughly) every r-th occurrence of a symbol a in S. The subsequence S' is obtained by merging 
subsequences S' a for all o E S. Finally S is obtained by merging S' and So- We support queries 
select^(z, S) on S, defined as follows: select^*, S) = j such that (i) a copy of S[j] is stored in S and 
(ii) if select a (i, S) =ji,thenj < ji and copies of elements S[j + 1], S[j + 2], ..., S[ji] are not stored 
in S. That is, select ' a (i,S) returns the largest index j, such that S[j] precedes S[select a (i, S)] and 
S[j] is also stored in S. The data structure for S delivers approximate answers for select queries; 
we will show later how the answer to a query select a (i, S) can be found quickly if the answer to 
select^*, S ) is known. Queries select'(z, S) can be implemented using standard operations on a bit 
sequence of size 0((n/r) log log n) bits; for completeness, we provide a description in Section fA.81 
We remark that S and S' are introduced to define S ; these two subsequences are not stored in our 
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Name 

Purpose 

Alph. 

Size 

Dynamic/ 

Static 

So 

Subsequence of S 

- 

Dynamic 

Si, 1 < i < r 

Subsequence of S 

- 

Static 

M 

Positions of symbols in S), i >1. that are marked as deleted 

const 

Dynamic 

R 

Positions of symbols from So in S 

const 

Dynamic 

S 

Delivers an approximate answer to select queries 

- 

Dynamic 

S' a ,a£E 

Auxiliary sequences for S 

- 

Dynamic 

E 

Positions of symbols from S in S 

const 

Dynamic 

B 

Positions of symbols from So in S 

const 

Dynamic 

D a 

Positions of symbols marked as deleted among all a’s 

const 

Dynamic 


Table 2: Auxiliary subsequences for answering rank and select queries. A subsequence is dynamic 
if both insertions and deletions are supported. If a subsequence is static, then updates are not 
supported. Static subsequences are re-built when they contain too many obsolete elements. 


data structure. The bit sequence E indicates what symbols of S are also stored in S: E[i\ = 1 if 
a copy of 5[i] is stored in S and E[i\ = 0 otherwise. The bit sequence B indicates what symbols 
in S are actually from So: B[i\ = 0 iff S[i ] is stored in the subsequence So- Besides, we keep bit 
sequences D a for each a € E. Bits of D a correspond to occurrences of a in S. If the £-th occurrence 
of a in S is marked as deleted, then D a [l\ = 0. All other bits in D a are set to 1. 

We provide the list of subsequences in Table [2j Each subsequence is augmented with a data 
structure that supports rank and select queries. For simplicity we will not distinguish between a 
subsequence and a data structure on its elements. If a subsequence supports updates, then either 
(i) this is a subsequence over a small alphabet or (ii) this subsequence contains a small number of 
elements. In case (i), the subsequence is over an alphabet of constant size; by |35,;20] queries on such 
subsequences are answered in 0(logn/loglogn) time. In case (ii) the subsequence contains 0(n/r) 
elements; data structures on such subsequences are implemented as in Lemma El All auxiliary 
subsequences, except for S, are of type (i). Subsequence So and an auxiliary subsequence S are of 
type (ii). Subsequences Si for i > 1 are static, i.e. they are stored in data structures that do not 
support updates. We re-build these subsequences when they contain too many obsolete elements. 
Thus dynamic subsequences support rank, select, access, and updates in 0(logn/loglogn) time. 
It is known that we can implement all basic operations on a static sequence in 0(logn/loglogn) 
tinuH. Our data structures on static subsequences are based on the approach of Barbay et al. [5]; 
however, our data structure can be constructed faster when the alphabet size is small and supports 
a substring extraction operation. A full description will be given in Section IA.71 We will show 
below that queries on S are answered by 0(1) queries on dynamic subsequences and 0(1) queries 
on static subsequences. 

We also maintain arrays Size\\ and Count a [] for every a £ E. For any 1 < i < r, Size[i\ is the 
number of symbols in Si and Count a [i\ specifies how many times a occurs in Si- We keep a data 
structure that computes the sum of the first i < r entries in Size[i\ and find the largest j such 
that J2t= l Size[t] < q for any integer q. The same kinds of queries are also supported on Count a \\. 
Arrays Size\\ and Count a [} use 0(a ■ r ■ logn) = 0(n/ log n) bits. 

3 Static data structures also achieve significantly faster query times, but this is not necessary for our implementa¬ 
tion. 
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Queries. To answer a query rank a (z, 5), we start by computing i! = selecti(i, Af); if is the 
position of the i-th element that is not marked as deleted. Then we find io = ranko (i',R) and 
i\ = ranki (i', R). By definition of R, io is the number of elements of 5[l..i] that are stored in the 
subsequence Sq. The number of a’s in 5o[l..*o] is computed as c\ = rank a (io, So). The number 
of a’s in Si,... ,S r before the position if is found as follows. We identify the index t, such that 

= i Si ze \j] < i\ < Y^j=i Size[j\. Then we compute how many times a occurred in Si,..., St, 
c 2 ,i = Y?j=i Count a \j], and in the relevant prefix of S t +i, c 2 ,2 = rank a (ii — )T* =1 Size\j], 5*+i )■ 

Let c 2 = ranki(c 2j i +c 2i2 , D a ). Thus c 2 is the number of symbols ’a’ that are not marked as deleted 
among the first c 2) i + c 2)2 occurrences of a in S \ Sq. Hence rank a (i, S) = ci + c 2 . 

To answer a query select a (i, S), we first obtain an approximate answer by asking a query 
select^(z, S). Let if = select i(i, D a ) be the rank of the i-th symbol a that is not marked as deleted. 

Let l 0 = select^ (i', S). We find l\ = ranki (7 q, E) and Z 2 = select 0 (rank a (Zi, 5) + 1,5). Let first = 
selecti(/i, E) and last = select 1 (Z 2 , E) be the positions of 5[/i] and 5[Z 2 ] in 5. By definition of select / , 
rank a (first,S) < i and ra,nk a (last, S) > i. If rank a (first,S) = i, then obviously select a (i,5) = 
first. Otherwise the answer to select a (i,5) is an integer between first and last. By definition of 
5, the substring S[first ], S[first+ 1], ..., 5[Zast] contains at most r occurrences of a. All these 
occurrences are stored in subsequences Sj for j > 1 . We compute io = rank a (ranko {f irst, R), Sq) 
and i\ = if — io- We find the index t such that /C!jZ\ Count 0 fj] < i\ < Y?j=i Count a [j\. Then 
v\ = select 0 (ii — Yl'jf=\ Count a \j}, St) is the position of 5[select a (i, 5)] in St- We find its index in 5 
by computing u 2 = vi + Size[j\ and ^3 = select 1 (n 2 , i?). Finally select a (L 5) = ranki(^ 3 , Af). 

Answering an access query is straightforward. We determine whether S[i] is stored in Sq or 
in some Sj for j > 1 using R. Let if = selecti(z,Af). If R[i'\ = 0 and S[i] is stored in So, 
then S[i] = 5o[ranko {%', f?)]. If R[i'] = 1, we compute i\ = ranki (i',R) and find the index j 
such that J2t=i Size[t\ < h < J2t=i Siz:e[t\. The answer to access(i,5) is S[i] = 5j[z 2 ] for i 2 = 
h ~ YStZi Size[t\. 

Space Usage. The redundancy of our data structure can be estimated as follows. The space 
needed to keep the symbols that are marked as deleted in subsequences Sj is bounded by 0((n/r) (log < 7 + 
logr)): Let n a denote the number of symbols a that are marked as deleted and let n = Yl a ™a- 
Then all symbols that are marked as deleted use X = TT n a log bits. Since CZff < 

X < Yla^a + If n < n/r 2 , X = o(n). If n > n/r 2 , then X = 0(n/r ) + 0(n log r) + 

X) a log W = 0(7 (l°g <7 + l°g r)). So also takes 0((n/r) log a) bits. The bit sequences R and M 
need O((n/r) log r) = o(n) bits; B, E also use 0((n/r) log r) bits. Each bit sequence D a can be 
maintained in 0(n' a log (n a /n' a )) bits where n a is the total number of symbols a in 5 and n' a is the 
number of symbols a that are marked as deleted. All D a take 0(X) a es n ' a log ^f). To estimate the 
last expression, we divide the alphabet E into Ei and S 2 ; Sigma\ contains all symbols a such that 
n' a > n a /log 2 n and E 2 contains all symbols a, such that n' a < n a /log 2 n. Then ^aes n al°§ 7 ^ = 
EaeS! n a lo S 7 J + Eaes 2 n ' a log ^ < (2n/r) log log n + (n/ log n) = O ( (n/r) log log n). Hence all 
D a need O ((n/r) log log n) = o(n) bits. The subsequence S can be stored in 0((n/r) log a) bits. 
Thus all auxiliary subsequences use 0((n/r) (log a + log?’)) = 0(?r loso 1( 1 ) ° g ft losn ) bits. Data structures 
for subsequences Si, r > i > 1, use Yli=i( n i^k(Si) + o(nt logo - )) = nfffc(5 \ 5o) + o(nlogcr) bits 
for any k = o(log CT n), where n t is the number of symbols in 5*. Since H^(S) < Hq(S) for k > 0, 
all subsequences 5* are stored in nHo(S) + o(n log a) bits. 
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Updates. When a new symbol is inserted, we insert it into the subsequence So and update the 
sequence R. The data structure for S is also updated accordingly. We also insert a 1-bit at the 
appropriate position of bit sequences M and D a where a is the inserted symbol. Deletions from 
So are symmetric. When an element is deleted from Si, i > 1, we replace the 1-bit corresponding 
to this element in M with a 0-bit. We also change the appropriate bit in D a to 0, where a is the 
symbol that was deleted from Sj. 

We must guarantee that the number of elements in So is bounded by 0(n/r); the number of 
elements marked as deleted must be also bounded by 0(n/r). Hence we must re-build the data 
structure when the number of symbols in So or the number of deleted symbols is too big. Since 
we aim for updates with worst-case bounds, the cost of re-building is distributed among 0(n/r) 
updates. We run two processes in the background. The first background process moves elements of 
So into subsequences Sj. The second process purges sequences Si, S r and removes all symbols 
marked as deleted from these sequences. Details are given in Section fA.41 

We assumed in the description of updates that logn is fixed. In the general case we need 
additional background processes that increase or decrease sizes of subsequences when n becomes 
too large or too small. These processes are organized in a standard way. Thus we obtain the 
following result 

Lemma 3 A dynamic string S'fl, n] over alphabet S = { 1,..., a } for a < n/ log 3 n can be stored in 
a data structure that uses nHo + 0{n log — ) + O (n(log log a ) 3 ) bits and answers queries access, 
rank and select in time 0(logn/loglogn). Insertions and deletions of symbols are supported in 
0(logn/loglogn) time. 

4.1 Compressed Data Structure for cr > n/log 3 n 

If the alphabet size a is almost linear, we cannot afford storing the arrays Count a \\- Instead, 
we keep a bit sequence BCount a for each alphabet symbol a. Let s a j denote the number of a's 
occurrences in the subsequence Si and s a = YH=i s a,i- Then BCount a = l s “’ 1 01 s “’ 2 0... l Sa v. If 
s a < rlog 2 n,we can keep BCount a in 0(s a log = 0(s a log log n) bits. If s a > rlog 2 n, we can 
keep BCount a in 0(r log = 0((s a /log 2 n) logn) = 0{s a / logn) bits. Using BCount a , we can 
find for any q the subsequence Sj, such that Count a [j ] < q < Count a [j + 1] in 0(logn/loglogn) 
time. 

We also keep an effective alphabet for each Sj. We keep a bit vector Mapj[] of size cr, such 
that Mapj[a] = 1 if and only if a occurs in Sj. Using Mapj[], we can map a symbol a € [l,n] to 
a symbol mapj(a) = ranki (a,Mapj) so that mapj(a) E [l,|Sj|] for any a that occurs in Sj. Let 
Hj = { mapj(a) \ a occurs in Sj }. For every mapj (a) we can find the corresponding symbol a using 
a select query on Mapj. We keep a static data structure for each sequence Sj over Ej. Queries and 
updates are supported in the same way as in Lemma [3j Combining the result of this sub-section 
and Lemma [3j we obtain the data structure for an arbitrary alphabet size. 

Theorem 1 A dynamic string «S[l,n] over alphabet E = {1,..., cr } can be stored in a data 
structure that uses nHo + 0(n logg 1( * ) ° g n logn ) + 0(n(loglogcr) 3 ) bits and answers queries access, 
rank and select in time 0(logn/loglogn). Insertions and deletions of symbols are supported in 
0(logn/loglogn) time. 

4 An alphabet for Sj is effective if it contains only symbols that actually occurred in Sj. 
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5 Compressed Data Structure II 


By slightly modifying the data structure of Theorem [T| we can reduce the space usage to essentially 
Hk(S ) bit per symbol for any k = o(log (T n) simultaneously. First, we observe that any sub-sequence 
Si for i > 1 is kept in a data structures that consumes Hk(Si)+o(\Si\ log a) bits of space. Thus ah Si 
useY^i = i(niHk(Si)+o(rii\oga)) = nHk(S\So)+o(n log a) bits. It can be shown that nHk(S\So) = 
nHk(S) + 0(n( 1 + ^^)) bits; we prove this bound in Section fA.61 Since r = O (log n/ log log n), 
the data structure of Theorem [T| uses nHk + 0(n ) + O(nloglogn) + 0(n(log log cr) 3 ) bits. 

In order to get rid of the 0(n log log n) additive term, we use a different static data structure; 
our static data structure is described in Section IA.71 As before, the data structure for a sequence 
Si uses \Si\Hk + o(|Sj| log cr) bits. But we also show in Section IA.7I that our static data structure 
can be constructed in 0(|Sj|/log 1 / 6 n) time if the alphabet size a is sufficiently small, cr < 2 log /3 n . 
The space usage nHk(S) + o(n logo - ) can be achieved by appropriate change of the parameter r. 
If cr > 2 log /3n , we use the data structure of Theorem As explained above, the space usage is 
nHk + o(n log a) + 0(n log log n) = nHk + o(n log a). If cr < 2 log /3 n we also use the data structure of 
Theorem [1] but we set r = O(lognloglogn) and implement static data structures as in Section lA. 71 
The data structure needs nHk(S) + 0(n/ log log n) + 0(n(log log u) 3 ) = nHk(S) + o(n logo - ) bits. 
Since we can re-build a static data structure for a sequence Si in 0(|Si| log 1 / 6 /;.) time, background 
processes incur an additional cost of 0(log n/ log log n). Hence the cost of updates does not increase. 


Theorem 2 A dynamic string 5[l,n] over alphabet E = { 1,... , cr } can be stored in a data 
structure that uses nHk + 0(n los ° L 1 ° g n 1 ° sn ) + 0(n(log log cr) 3 ) bits and answers queries access, 
rank and select in time 0(logn/loglogn). Insertions and deletions of symbols are supported in 
0(logn/loglogn) time. 

6 Substring Extraction 

Our representation of compressed sequences also enables us to retrieve a substring S[i..i + i — 

1 ] of S. The static data structure, described in Section IA.7I supports substring extraction in 
0(logn/loglogn + £/log a n) time. Hence we can quickly retrieve a substring of any Si. We can 
also augment So with an 0((n/r) log a) additional bits, so that a substring of So is extracted in 
the same time. We can retrieve a substring of S by extracting a substring of So and a substring of 
some Si for i > 1 and merging the result. A detailed description is provided in Section IA.91 Our 
result can be summed up as follows. 

Theorem 3 We can augment data structures described in Theorem[]]and Theorem\^with 0((n/r) log a) 
additional bits, so that a substring of length ell can be extracted in 0((log n/log logu) + ell/\og a n) 
time. The parameter r = H(logn/loglogn) is defined in the same way as in Theorems [3 and [ H 
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A.l Colored Predecessor Queries 


In this section we consider predecessor queries on a linked list, called colored predecessor queries. 
The result of this section is used in the proof of Lemma [1] Suppose that each entry in an ordered 
list L is colored with a symbol a € E from an alphabet E = {1,..., a }. We will also sometimes 
say that an entry e contains a symbol a. A colored predecessor query (e q ,a) for an entry e q € L 
and a symbol a E E asks for the rightmost entry e € L that is colored with a and precedes e q . We 
consider the problem of answering colored predecessor queries on a dynamic list L. This problem 
was previously considered by Kopelowitz m who described a randomized 0(log log m + log log up¬ 
time solution. Mortensen m described an O (log log m) time solution for the case a = log c n and a 
constant c. We present here a deterministic solution for an arbitrarily large alphabet. This result 
is also of independent interest. 

We start by describing a data structure that uses more than linear space. Then we will show 
how the space usage can be reduced to linear and how the update time can be decreased. 

Lemma 4 Let L be a list with m <n entries. There exists an 0(m log 2 m)-bit data structure that 
answers colored predecessor queries on L in 0(log log m(log log o) 2 ) time and supports insertions 
and deletions in Oifogm) time. 

Proof: For a symbol a, let L a denote the sublist of L that consists of entries containing a. Each 
entry that contains a is augmented with a pointer to the next and the previous entries in L a . We 
also store an order maintenance data structure on L. This data structure can determine in 0(1) 
time whether e\ precedes e 2 in L for two arbitrary entries e\ E L and e 2 E L in a dynamic list L. 
We refer to [2 [24] for a description of such a data structure. 

We keep a balanced tree Tl on L. For a node u E Tl, the set Col(u) consists of all symbols a 
such that at least one leaf descendant of u contains a. In every leaf of Tl, we keep pointers to all its 
ancestors. For every a € C(u), we also keep a. min(u) and a. max(rt); a. min(u) (resp. a.max(it)) 
points to the leftmost (rightmost) element of L in the subtree of u colored with a. 

Suppose that we want to find the rightmost entry e a £l that contains a symbol a and precedes 
an entry e q E L. We look for the lowest ancestor u of (the leaf that contains) e q such that 
a E Dict(u). Using binary search on logn ancestors of e q , we can find u in 0(log log n(log logo - ) 2 ) 
time. If e q is in the right subtree of u, then e a = a. max(iq) where ui is the left child of u. If e q is 
in the left subtree of u, then we find e' a = a. min(u r ) where u r is the right child of u. The entry e' a 
is the leftmost entry that follows e q . Hence the entry e a is the first occurrence of a in L before e' a . 
In other words, e a precedes e! a in L a . 

When a new element e is inserted into L, we insert it into some leaf l e of Tl and a new entry 
into the corresponding list L a . Insertion into L a requires that we find the rightmost entry e a that 
is colored with a and precedes e in the list. This takes 0(log m/ log log n) time as described above. 
Then we visit all ancestors of l e in Tl- If necessary, we add a to C(v) in each visited node v. We 
keep the tree balanced, using the algorithms of weight-balanced B-tree [3j. The cost of maintaining 
Tl so that its height remains O(logm) is O(logm) per insertion. Deletions are symmetric. When 
an element e is deleted, we remove it from the list L a and update Dictiu) in at most one ancestor 
of e. Then we remove the leaf that contains e. The weight-balanced B-tree is not modified after 
the deletion of a leaf. But when a fraction of leaves is deleted, we construct a new tree Tl and 
discard the old instance of Tl- The process of re-building Tl can be run in the background so that 
the total worst-case cost of deleting e is O(logm). □ 
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Lemma 5 Let L be a list with m < n entries. There exists an 0{m\ogrn)-bit data structure that 
answers colored predecessor queries on L in 0((log logo - ) 2 log log m) time and supports insertions 
and deletions in 0((log logo - ) 2 log log m) time. 

Proof: We divide every L a into Oi\L a \/log m) blocks so that every block contains 0(log 2 m) 
consecutive entries of L a . If L a consists of more than one block, then we maintain the list L' a 
that contains the first entry from every block of L a . The list L\ contains all elements of L' a for all 
symbols a. We keep L\ in the data structure of Lemma |5J For any symbol a, all elements of L a 
are also stored in a data structure T a that supports finger searches [IB]: For any element e q € L 
and a finger e' a € L a , T a can return the rightmost entry e a that is colored with a and precedes e q 
in O(logd) comparisons, where d is the number of entries between e' a and e a in L a . Finally we also 
keep the list L in the union-split-find data structure of Mehlhorn m- Using this data structure, 
we can find the first e! € L\ that precedes any e € L in O(loglogm) time. The data structure of 
Mehlhorn et al. m uses 0(m) words and supports updates in O(loglogm) time. 

In order to find e a colored with symbol a that precedes e q , we find the first entry e' € L\ 
that precedes e q . Then we identify the first entry e' a colored with a that precedes e!. There are 
0(log 2 m) entries of L a between e' a and e a . When e' a is known, we can find e a in O(log log m) 
time using finger search on T a . The total query time is dominated by the search in L\ and equals 
0 (log log m(log log a) 2 ). 

When a new entry e of color a is inserted, we update L. Then we find the position of e in L a 
and update L a and T a . We can maintain the sizes of blocks in lists L a so that each block consists 
of 0(log 2 m) entries and there is one insertion into L\ for O(logm) insertions into L; details will be 
given in the full version. Thus the total cost of an insertion is 0((log logo - ) 2 log log m). Deletions 
are symmetric. □ 

A.2 Prefix Sum Queries on a List 

In this section we describe a data structure on a list L that is used in the proof of Lemma [1] in 
Section [2] 

Lemma 6 We can keep a dynamic list L in an 0{m\ogm)-bit data structure D(L), where m is the 
number of entries in L. D(L ) can find the i-th entry in L for 1 < i < m in 0(log m/ log log n ) time. 
D{L ) can also compute the number of entries before a given element e € L in 0(logm/loglogn) 
time. Insertions and deletions are also supported in 0(log m/ log log n) time. 

Proof: D(L ) is implemented as a balanced tree with node degree 0(log e n). In every internal node 
we keep a data structure Pref (u): Pref(u ) contains the total number n(rq) of elements stored below 
every child iq of u. Pref(u) supports prefix sum queries (i.e., computes Yll=i n ( u i) f° r an y t) and 
finds the largest j, such that n ( u i ) Q f° r an y integer q. We implement Pref(u ) as in Lemma 
2.2 in [36] so that both types of queries are supported in 0(1) time. Pref(u ) uses linear space (in 
the number of its elements) and can be updated in 0(1) time. Pref(u ) needs a look-up table of size 
o(n £ ). To find the i-th entry in a list, we traverse the root-to-leaf path; in each visited node u we 
find the child that contains the i-th entry using Pref(u). To find the number of entries preceding 
a given entry e in a list, we traverse the leaf-to-root path it that starts in the leaf containing e. In 
each visited node u we answer a query to Pref(u): if the j-th child uj of u is on 7r, then we compute 
s(u) = ^i=i* n ( u i) using Pref(u). The total number of entries to the left of e is the sum of s(u) 
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for all nodes u on 7 r. Since we spend 0(1) time in each visited node, both types of queries are 
answered in 0(1) time. An update operation leads to 0(log m/ log log n) updates of data structures 
Pref{u). The tree can be re-balanced using the weight-balanced B-tree [3], so that its height is 
always bounded by O (log m/ log log n). □ 

A.3 Updating Data Structure in Lemma [2] 

When the size of a chunk C, equals 2a we start the procedure of re-building this chunk. During 
the next a /2 updates of C{ we retrieve all elements of C, and insert them into data structures for 
new chunks, C[ and C". If an update is a deletion of some element e and e was already copied into 
C[ or Of, then we remove the copy of e from C[ or Of. When all elements of O* are copied into 
O' and Of, we say that a chunk Q is a copied chunk. We keep ids of all copied chunks in a data 
structure L Whenever a copied chunk C, is updated we also execute the same update of O' or 

We also run the following iterative procedure that replaces copied chunks with two chunks. 
Each iteration starts by finding a chunk C\ with the largest number of elements. Then all arrays 
B a are updated in increasing order of a. We insert a O-bit at an appropriate position of B a so that 
B a = 1*0... UO.. . is changed to B a = l dl 0 ... UoiU)... where c?,;, d\ and d " denote the number 
of a’s that occur in Oj, O' and Of respectively. We keep a variable lastsym that equals the largest 
symbol a, such that B a is already updated. When all B a are modified in the above manner, we also 
update B t and change it from B t = l rai 0... l ni 0... to B t = l ni 0... Uoi^O... where n,, n' and 
n'l denote the total number of symbols in O*, O' and Of respectively. Finally we delete the id of 
Ci from Lrf set lastsym = 0 and start the next iteration. Every iteration takes O(a) time. When 
a chunk is added to Lj, its size does not exceed 5a/2. Using Theorem 5 in [IT], we can show that 
the size of each chunk in L^ grows by at most by a ■ 0{h n ) where h n = O(logn) denotes the n-th 
harmonic number. 

We slightly modify the method for answering a select query. Let k denote the index of the last 
chunk that was retrieved from Ld- That is, the above described iterative procedure is currently 
changing bit vectors B a and B t changing B a = ... l dfc 0 ... to ... ]A01 d fc0 ... and B f = ... l rafc 0 ... 
to ... I n fe01 ra fe0.... To answer a query select a (z, S ), we first find the index *2 of the chunk Ci 2 that 
contains the i-th occurrence of i, *2 = ranko(select 1 ( i, B a ), B a ) + 1. If < k or a > lastsym , we 
proceed as described in the proof of Lemma [2] If i > k and a < lastsym, we decrement *2 by 1, 
^2 = *2 — 1 and also proceed as in Lemma [21 

We also keep track of the number of chunks that contain no more than a elements. If there are 
at least n/2a chunks containing at most a symbols, then we start a global re-building procedure. 
We retrieve all elements of S and insert them into a new data structure. In the new data structure 
all elements are distributed among chunks, so that each chunk contains a elements. The global 
re-building process is executed during n/4<r updates. 

A.4 Re-Building Compressed Data Structure in the Background 

As shown in Section [IJ we must bound the total number of symbols in Sq by 0{n/r ) for a parameter 
r. We must also bound the number of symbols in Si for i > 1 that are marked as deleted by 0(n/r). 
We run two alternating processes in the background to satisfy these requirements. In order to bound 
the workspace we process sub-sequences S',; one-by-one. For every i, 1 < i < r, we produce a new 
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version S' of S',; containing all relevant elements of So (i.e., all elements of So that precede the 
first element of S ,+1 and follow the last element of Si- 1 in S). In order to navigate in the new 
version of 5’,, we must modify parts of auxiliary sequences (such as R, S, E, and B). Therefore 
our background process also produces new versions for the relevant portions of auxiliary sequences. 
When the new version of S', is created, we discard the old version; we also replace the parts of 
auxiliary sequences with their new versions. The second background process removes elements 
marked as deleted and updates S t in the same manner. A more detailed description follows. 

We conceptually divide So into r substrings So,, for 1 < i < r. An element e G So is in So,, for 
1 < i < r iff e precedes the first element of S,+i in S and follows the last element of Si- \ in S. An 
element e € So is in So,i if e precedes the first element of S 2 ; e € So is in So, r if e follows the last 
element of S r - 1 . Likewise the sequence S is conceptually divided into r substrings Si,... ,S r . An 
element e € S is in S', for some i > 1 if e is a copy of some e! € 5, or e is a copy of some e! € So 
and e! € So,*- We conceptually divide the binary sequence R using the same principle: R[j] is in 
Ri if the j-th element of S is from S, or the j-th element of S is some e! € So such that e! € So,*. 
Other binary sequences are divided in the same way. The procedure for moving elements of So into 
Si for some i, 1 < i < r, is as follows. 

Step 1 We start by creating a new instance S c of Si and a new instance S c of S l ; we also create 
new instances of EC and the i-th parts of other binary sequences; namely R c , Df for all a G £ 
such that a occurs in Si, B c and E c are copies of R.,, D a ^, Bi and £) respectively. The cost of 
creating new instances for parts of auxiliary sequences can be distributed among the following 
updates of S, as will be explained below. At the end of Step 1, R c is a copy of i?.,; likewise 
Df, B c and E c are copies of D a ^, Bi and E, respectively. These newly created sequences will 
be called copy sequences. 

Step 2 Then we insert the elements of So,, at appropriate positions of Sf. We modify the sequence 
S c accordingly. Changes in S c and Sf also lead to changes in copy sequences R c , D c a , B c and 
E c . We distribute the cost of Step 2 among updates of S. We will say that all elements that 
are kept in Sq (resp. in Sf) upon completion of Step 1 are old elements. When a sequence 
S is updated, we spend 0(logn/loglogn) time on the following actions: (i) we find the next 
unprocessed element e n in Sq (symbols in Sf are processed in the left-to-right order); we set 
the bit corresponding to e n in R c to 1 (ii) we insert e n at appropriate position of Sf (iii) if 
necessary, we update S c ; copy sequences B c and E c are updated accordingly. We may also 
need to update copy sequences after an update of S. If the update of S is an insertion, and 
a new element e is inserted into So,,, then we also insert e into Sf. If an element e is deleted 
and e € .Sq.,;, then we remove the copy of e from Sq; changes in Sf can also lead to changes 
in S c . If a symbol a is deleted from Si, then we update Df accordingly. 

Step 3 When Sf is completed, we discard old Si, set Si = Sf, and start using the new Si from 
now on. Simultaneously we replace the relevant section of S with S c . We also replace the 
relevant parts of R, D a , B and E with R c , Df, B c and E c . 

In order to execute the above background process, we must implement binary sequences, so that 
two additional procedures are supported: A binary sequence of length m is divided into r sectors 
(substrings) of length 0(m/r) each. We can produce a copy of each sector. The cost of producing 
a copy is distributed among m 2 1 °^ o 1 °^ n updates; when the procedure is finished, the sector and its 
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copy are equal. We can perform updates on the original sequence and on a sector copy. We can also 
replace a sector with its copy and discard the original sector. Same procedures are also supported 
for the non-binary sequence S. We can implement these procedures in such way that the cost 
of rank, select, access, and updates is not increased. Implementation of auxiliary procedures is 
explained in Section lA.51 

Step 1 of the above process takes 0(n/r ) time. Step 2 (insertion of new elements into Sf) 
takes 0(uj(logn/loglogn)) time, where Vi is the number of elements inserted into Sf Step 3 
takes 0(logn/loglogn) time. Thus old elements of So are moved to S* for i > 1 in 0(n) + 
Y% (log nj log log n) = 0(n) time. This process can be distributed among n/4r updates. 

The process of purging the sequences Si, ..., S r is based on the same approach. For each 
i = 1 ,...,r, we create a new instance of S t without deleted elements; then we discard the old 
instance and start using the new version of S). Relevant parts of S and binary sequences are also 
updated. The re-building of Si is implemented in the same way as in the procedure of moving 
elements from So to S t for i > 1. The cost of purging Si is distributed among n/Ar following 
updates. Two above described background processes are run alternatingly; the first process starts 
when the either the number of elements in So or the number of elements marked as deleted is equal 
to n/Ar. In this way we guarantee that the number of elements in So and the number of deleted 
elements does not exceed n/r. 

A.5 Auxiliary Procedures for Binary and Non-Binary Sequences 

In this section we show how a sequence S can be stored in such a way that additional processes 
that create a copy for a part of S are supported. Furthermore we can update the copied part and 
later replace the original part with its modified copy. We start by describing a binary sequence 
that supports an additional operation init(S, m); init(S, m) initializes an empty sequence of length 
m that consists of m 0-bits. Recall that A = log nj log log n. 

Lemma 7 A binary sequence S that supports ranki(i, S), selecti(z, S), access (i,S), insertions, 
deletions and init(<S, m) for any m < n can be stored in 0(s log j) + o(n) bits, where s is the 
number of 1-bits andn is the length of the sequence. All operations, except for init(S,m) take 0(A) 
time; init(S,m ) can be executed in 0(1) time. 

Proof: We divide the sequence S into blocks Bi such that each Bi consists of 0(log 2 n) bits. Each 
block is further divided into sub-blocks of @(log 1//2 n) bits. We will say that a block or a sub¬ 
block is non-empty if it contains at least one 1-bit. A doubly-linked list L contains one entry for 
each non-empty block. We also keep a list Lj for every block B\ that contains 1-bits; L, contains 
one entry for each non-empty sub-block. For each entry of L we keep the number of l’s in the 
corresponding block Bp, we also keep the total number of bits in blocks B. J+ \, Bj +2 , ..., Rj_i, where 
Bj is the rightmost non-empty block that precedes Bj. We maintain a data structure that enables 
us to find the block that contains the i-th bit in the sequence. We also maintain a data structure 
that can find the block containing the i-th 1-bit (or O-bit) and the number of 1-bits (O-bits) that 
precede a specified block. We maintain the same data structure for each sub-block. All these data 
structures are implemented as balanced trees with node degree log £ n for a small constant e > 0. 
Each node is augmented with additional information about the number of 1-bits (resp. the total 
number of bits) in the subtrees of its children. Implementation is the same as for data structures 
D{L ) and D(L a ) in Lemma [lj 
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Positions of 1-bits in the same sub-block are difference coded: for every 1-bit we store the 
difference between its position and the position of the preceding 1-bit in the same block; for the 
first 1-bit in the block, we store its position in the block. The list L and its data structures can be 
kept in 0{n/\ogn) bits. All lists Li and their data structures are kept in O (n(log log n /Vlog n )) 
bits. Difference coding of 1-bits in all blocks consumes O(slog^) bits. 

To answer a query ranki (i, S ) we find the block Bi and its sub-block B t J containing the 1-th bit. 
Then we find the number of 1-bits that precede B *. in L and the number of 1-bits that precede B^j 
in Lfc. We can find the number of 1-bits that precede the bit with global position i in B^j using a 
look-up table. Summing three above values, we obtain ranki (i. S ). Queries selecti(i, S), ranko(i, S), 
and ranki (1, S) are computed in a similar way. Thus all queries are answered in 0(logn/loglogn) 
time. 

Since we only keep non-empty blocks and sub-blocks, operation init(S, m) takes constant time. 
Insertions and deletions are implemented as in previously known data structures supporting rank 
and select on binary sequences. When an element is inserted, we find its block Bi and its sub-block; 
we insert the new element into its sub-block and update lists L and Lj if necessary. We maintain 
sizes of blocks and sub-blocks using standard techniques. Deletions are symmetric. Hence insertions 
and deletions are supported in 0(logn/loglogra) time. □ 

Now we describe how a copy of a binary sequence S can be created. Let A = log to/ log log n. 

Lemma 8 Let S be a binary sequence of length s. Procedure copyQ, that produces a copy of S, 
can be implemented as a background process that runs during 0(s/ log n) consecutive updates. We 
can support updates on the original sequence and its copy in 0(A) time. Operations rank, select, 
and access are executed in 0( A) time. The underlying data structure uses sHo(S) + s + o(s) bits. 

Proof: The procedure for creating a copy S' of S consists of two stages. During the first stage we 
produce a copy of S. S is represented in the same way as in [35] • As described in [35], S is split 
into chunks and we maintain data structures that support counting the number of 0-bits (resp. 
1-bits) among the chunks and searching for the chunk that contains the i-th 0-bit (or 1-bit). We 
can create a copy by copying the original sequence of chunks. The data structure that supports 
counting and searching among chunks is essentially a tree with O(s') nodes; we can create this tree 
in O(s') time, where s' = 0(s/logn) is the number of chunks. 

Thus the background process that creates a copy of S takes 0(s/logn) time. We can distribute 
its cost among 0(s/(Alogn)) updates where A = log n/ log log n. We keep information about these 
updates in four data structures. The data structure U keeps information about positions of updates: 
the z-tli 1-bit in a sequence U is the position of the z-tli update (insertion or deletion) in S. Thus 
U contains one bit for every element in S and one bit for every element that was deleted from S. 
Updates are counted in the left-to-right order and U is implemented as in Lemma [7] We also keep 
a bit sequence T which indicates the type of updates on S: T[i\ = 0 if the i-th update stored in U 
is a deletion and T[i] = 1 if the i-th update is an insertion. The sequence B n contains the values 
of elements inserted into S. A sequence Ud helps us navigate between S and U ; Ud contains one 
bit for every element in S and one bit for every element that was deleted from S. If U ( i\j] = 1, 
then the corresponding element was already deleted from S; if Ud[j] = 0, then the corresponding 
element is in S. During each update, we perform the following operations: 

• a new element is inserted into or deleted from S at position i 
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• Let i' = selecto(z, Ud ) be the position of the z-th 0-bit in Ud- If the update of S is an insertion, 
we insert a 1-bit into Ud at position il. If the update is a deletion, we replace the z'-th bit in 
S with a 0-bit (replacement is implemented by deleting a 1-bit and inserting a 0-bit at the 
same position). 

• If the update of S is an insertion, we insert a 1-bit at position %' into U; if the update is a 
deletion we replace the z'-th bit of U with a 1-bit. We also insert a bit indicating the type of 
update into the sequence T. If an update is an insertion, we add the value of a new bit into 
a bit sequence B n . 

• we spend 0( A) time on constructing a copy sequence. 

The first stage is finished after 0(s/(A log rz)) updates of S. When the first stage is completed, 
S and its copy sequence S' differ because 0(s/(Alogn)) most recent updates changed the original 
sequence but were not performed on its copy. During the second stage we synchronize S and S'. 
The synchronisation procedure is also distributed among 0(s/Alogn) updates. During every up¬ 
date operation, we proceed as follows: 

- a new element is inserted into S or deleted from S. We also change the copy sequence S' accord¬ 
ingly. If the position z of an element in S is known, then we can find its position i n in S' using 
sequences Ud, U and T. Using Ud, we find the position id = selecto(z, Ud) corresponding to i in 
U. Using U, we find the number u of updates that precede id', using T we can find the number of 
insertions and deletions among the first u updates. 

- we also execute updates stored in sequences U, T, and B n . We retrieve the position i = 
selecti(l, U) of the first 1-bit stored in U and find the position i' in S' that corresponds to the 
position i in S. Then we either insert a new element at the position i! or remove the z'-th element 
from S' according to the data stored in T and B n . Finally we delete the z-th bit from U and Ud- 
We also delete the corresponding bit from T and remove the corresponding symbol from B n (if the 
processed update is an insertion). 

At the end of the second stage S' and S are equivalent. □ 

Now we consider the sequence S of length s that is divided into s/r contiguous parts for r = 
log°d) n Each part, called a sector of S, consists of 0(s/r ) elements. The procedure copy sector () 
creates a copy of an arbitrary sector. The procedure copysector{ ) can be executed in the background 
during a sequence of s/(r A) updates. Furthermore we can split a sector into two sectors and merge 
two adjacent sectors in the same time. Last, we can also replace a sector with its copy in 0(r) 
time. Update operations are supported on both S itself and on the copy of a sector (we assume 
that at any time a copy of only one sector is created or used). 

Lemma 9 Let S be a binary sequence of length s < n and let S be divided into r sectors of 
O(sfr) symbols. Procedure copy sector (), that produces a copy of a sector, can be implemented as a 
background process that runs during 0(s/(rAlogn)) consecutive updates. Procedures splitsectorf) 
and mergesectorsi) can be executed in the same way. Operation replacesectorQ can be executed 
in O(loglogrz) time. The underlying data structure uses sHo(S) + o{n) bits. 

Proof: Every sector is maintained in the data structure of Lemma 0 Furthermore we maintain a 
sequence G that keeps the numbers of elements in every sector. Sequences Go and G\ maintain the 
number of 0’s and l’s in every sector. Using G and data structures for individual sectors, we can 
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answer rank, select, and access queries on S. Procedure copysector () is implemented as copyQ on a 
sector of S. When a copy of a sector is ready, we can support updates on this copy. Besides we can 
also replace a sector with its copy and update the data structure on the sequence G accordingly; 
this operation takes O(loglogre) time. Splitting and merging of sectors is implemented in a similar 
way. Suppose that we want to split a sector Si into S[ and S”. We employ the same two-stage 
procedure that was used to create a copy of a sector. During the first stage we assign elements of 
Si to S' t and S'/. Then we create the data structures for S[ and S'/. Updates that are relevant for 
new versions are deposited in data structures U\, Tf, B\ and U 2 , ? 2 , B 2 respectively. During the 
second stage we execute updates stored in Ui, Ti, and Bi for i = 1,2. Auxiliary data structures are 
realized in the same way as in the procedure copy sector (). When new sectors S' t and S'/ are ready, 
we replace Si with S[ and S'/ and update G. □ 

We can implement similar procedures for a sequence over a general alphabet. We assume, 
however, that copies of sectors are produced consecutively: first copy of the first sector is created; 
when the first sector is replaced with a (possibly modified) copy sector, we create the copy of the 
next sector, etc. In this scenario, it is easy to maintain the dynamic sequence that supports the 
copying, splitting and merging of sectors. 

Lemma 10 Let S be a sequence of length s < n over an alphabet E = { 1,... a } and let S be divided 
into r sectors. Procedure copynextsector(), that produces a copy of a sector, can be implemented as 
a background process that runs during 0(s/(r\)) consecutive updates. Procedures splitnextsectorQ 
and mergesectorsQ can be executed in the same way. Operation replacenextsector can be executed 
in 0(A) time. The data structure for S uses O(sloger) bits. 

Proof: Elements of S are distributed among two dynamic data structure, S 0 id and S new . Both of 
them are implemented as in Lemma [2j Originally S new is empty and all elements of S are in S 0 id- 
Procedure copynextsectorQ traverses elements of the next sector and appends them at the end of 
the new sequence S new . When replacenextsectorQ is executed for the last (rightmost) sector, we 
set S 0 id = S new and S new = 0. Let i p denote the number of elements in all sectors of S Q id for which 
operation replacenextsectorQ was executed. Let i m denote the total number of elements currently 
kept in S new . We can answer accessQ, S ) by retrieving S new [i\ if i < i m or retrieving S[i — i m + ip] 
if i > i m . We can answer a query rank a (i, S') as follows. If i < i m , rank a (i, S) = rank a (i, S new ). 
If i > i m , rank a (z, S) = rank a (i — i m + i p , S a id )■ To answer a query select a (i, S) we check whether 
rank a (i m , S new ) > i. If this condition is satisfied, then select a (i, S) = select a {i,S new )] otherwise 
select a (i, S) = select a (z / , S old ) where i' = i- rank a (i m , S new ) + rank a (i p , S 0 i d ). □ 

A. 6 Analysis 

We show that deleting n/r symbols from a sequence S does not increase too much the k -th order en¬ 
tropy. This result is needed in Section[5]to prove the space bound of nHi-+o(n log cr)+0(n(log n)/r). 
Let S = S[l] ... S[n], Let So denote the subsequence of symbols that are deleted from S and let 
S n = S \ So. Let IS'ol = n/r for a parameter r. We want to estimate \S n \Hk(S n ) — |S|.fffc(S) for 
some parameter k < a log CT n — 1 and a < 1. 

A context c, is an arbitrary sequence of length k over an alphabet er; let f a ,i denote the number 
of times a symbol a is preceded by context q in S and n* = X^aes /o,i- The k -th order empirical 
entropy is defined as E Ci es fe Saes /«,* lo £ jf~- 
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For a context q, let n/ be the number of times it occurs in S and let n[ be the number of times 
it occurs in S \ Sq. Suppose that a symbol S[i] is deleted. It changes the context for the next k 
symbols S[i + 1], S[i + k]. We will say that one deletion spoils k symbols and moves them to a 
different context. If a symbol S[l] is spoiled and the context of S[l] in S n is q, then S[l] is encoded 
with at most log(n() bits. Let pi = — rq be the number of new symbols in the context q. Let f' a t 

be the frequency of a new symbol a in q (that is, the number of times a spoiled symbol a appears 
in the context q in S n ). Then the total encoding length of spoiled symbols in the context q does 


not exceed Ya f' a i lo S 7 ^ where Ha f'a i = Pi- Jensen’s inequality, Ya f' a t log -S 2 - < p, log -A-. 

’ J a,l ’ ’ J a,l Ei / 


Summing over all contexts q, the total encoding length of spoiled symbols can be bounded by 


HiPli logJJ + logcr). 

The total number m of symbols that are spoiled is between n/r and ( n/r)k — 1 because each 
deletion spoils between 1 and k following symbols. The number of spoiled symbols does not exceed 
n independently of r and k. Hence Hi Pi — (n/r)log a n. Besides Hi Pi — Hi n i — n • Therefore 
Hi Pi l°g = 0 (n). To prove the latter fact, we divide all contexts q into classes L\, L 2 , ..., L\ og * n . 


Li contains all context indices l, such that /j_i(ro) > b- > fi(n), where fo(n) = n and fi{n) = 


(log W n) 2 for i > 1. For any L. u Hi£LiPi lo &n < 


o( 


log^) 11 


Hence Y.PI lo § 2 = Ei=i " Hie U Pi lo § m= n HT=i " °( 


Pi (logi*i n ) 2 


log (fi-!(n)) = 


log* n 


1 

7U7. 


0 (logl*l n) = 
Hence Hj[ = 0{n) 


because Hi=\ 


log* n 


1 


logW n 


= 0(1). Thus the total encoding length of all spoiled symbols is bounded 


by E add = 0(n( 1 + ^)). 

Another factor that may increase the encoding length is that spoiled symbols are moved to new 
contexts and thus the encoding length of all other symbols in these new contexts slightly increases. 
Consider a context q that occurred n* times in S and rq + f. t times in S\Sq for some /) > 0. We 
say that a symbol S[l] = a that follows q in S \ So is an old occurrence if this occurrence of a is 
also preceded by q in S. The encoding length for all old occurrences in S n is VT , occ a t log 1 ■ 
The total encoding length for the same occurrences in S is occ a i log — . The difference 

between encoding lengths of old occurrences in S and S n is mc(cj) = n,; log ni ^ 1 ■ If fi < fii-, 
then inc{ci) < n*. Summing up the differences over all contexts q such that /* < rii, we obtain 
Ei < ni < n. If fi > rii, then mc(q) < nj(log “ + 1) Summing up over all contexts q such that 

fi > rii, we get E 2 = Yi n i (. 1 °S pf + !)• Since Yi n i < Hi fi < n , Hi n i < n and Hi n i lo g ^ < n - 
Hence, E 2 = 0(n). 

Thus \S n \Hk(S n ) — |5|iLfe(S') = E add + E\ + E 2 = 0(n( 1 + ^^)). We must also account for 
elements that are marked as deleted, but are still stored in sequences Si for i > 1. The number of 
elements that are marked as deleted is bounded by 0(n/r). These elements need 0(n^L) bits. 
Every deleted element spoils up to k symbols of S. Using the same analysis as above, the extra 
encoding length due to spoiled symbols can be estimated to be 0(n( 1 + ^^)). Thus all static 
sequences S) for i > 1 are stored in riH}. + 0(n( 1 + ^-^)) bits. 


A.7 Static Data Structure 

In this section we describe a static data structure supporting access, rank and select queries. In 
comparison to previous static data structures, we obtain two additional results. Our data structure 
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can be constructed quickly if the alphabet size a is small. At the same time we show that our data 
structure supports extraction of a substring of length i in optimal 0(logn/loglogn + i/\og a n) 
time. As before, let S denote a sequence of length n over an alphabet £ = { 1... , a }. 

Our static representation keeps the sequence S in compressed form following the approach ofim 
S is represented as a sequence S M of meta-symbols over an alphabet for £ = \ los 9 g n ]. That is, 
each meta-symbol encodes £ symbols of the original sequence. It is shown in [13] that Hq(S ai ) < 
Hk{S) + (n/£)kloga simultaneously for all k < £. We can keep S AI in ( n/£)(Ho(S M ) + 0(1)) bits 
using e.g. Huffman coding. 

Data Structure for Rank and Select Queries. We split S into blocks of size a. For every 
a £ [1, cr], we keep a binary sequence B a = l Sl 01 S2 0... l s, 0 where Si denotes the number of a’s 
occurrences in the i-tli block. It was shown in [5] how query rank a (i,S) or select a (i, 5) can be 
reduced to 0(1) rank and select queries on a block C and 0(1) queries on B a . The data structure 
for a block C is as follows. We keep a bit vector V = l ni 01 n2 0... l na where n a is the number of 
times a occurs in C. Let ir(i) denote the position of C[i] in the stable sorted ordering. That is, n 
is the permutation of C obtained by stably sorting the symbols of C. Let 7r _1 denote the inverse 
of 7r. Then select a (i, C) = 7r _1 (j) where j = (X)))=i n g) + *• We can find Y ^=l n g f° r an Y P ^ a by 
answering one rank and one select query on V. 

Let t = log u/(log log cr) 3 . For every symbol a, 1 < a < a, the set F a contains every t-th 
occurrence of a in C; that is F a contains all j such that C[j] = a and rank n (j, C) = t ■ i for some 
integer i. We keep a y-trie data structure on F a , so that for any q we can find the largest j € F a 
satisfying j < q. Furthermore we store values of rank a (j, C) for all j £ F a . For each symbol 
C\j]. we also keep R[j] = (rank^yj (j, C) mod t). We need oTog t bits to store the array R and 
0((a/t) log a) = 0(cr(log log cr) 3 ) bits to store F. Hence the total space usage is o(<rlog<j). 

Let rank(j(i, 5) denote the partial rank query: if S[i] = a, rank(j(i,5) = rank a (i, S): otherwise 
rank(j(i, S) is undehned. If C[i] = a, rank^ (i, C) = R[i] +rank a (j, C) where j is the largest position 
in F a such that j < i. Since j can be found in (log log a) time, rank^Hz, C) can be computed in 

O(loglogir) time. We can compute 7r(z) as follows. If C[i] = a, then 7r (i) = (X)j=i raj)+rank(j(i, C). 
Since rank ; can be computed in 0(loglog<r) time, we can find n(i) for any i in O(loglogfj) time. 
Using the data structure of [32] . we can compute 7r _1 (i) in 0(t ■ f(<r)) time using O(n\ogo/t) 
additional bits, where f(a) is the time needed to compute f(cr). This data structure works as 
follows: We decompose the permutation 7r = 7r(l), 7 t(2 ), ... , 7r(<r) into cycles. A cycle is the shortest 
subsequence ii,... ,i s of 7r such that 7r (ij) = ij + \ for 1 < j < s and n(i s ) = i\. For every cycle of 
length s > t, we select every t-th element and mark it. We keep the value of 7r~* for the marked 
elements where 7denotes the inverse of 7r iterated t times. In order to find 7r^ 1 (t), we compute 
7 r(i), 7 r 2 (i) = 7r(7r(*)), n 3 (i), ... until we reach a marked position i m or 7r k (i) = i for some k. If 
ir k (i) = i, then 7r _1 (i) = TT k ^ 1 (i). If we reached a marked position i m , we compute i! = 7r _t (?' m ). 
Then we identify 7r(? 7 ), 7 t 2 (z / ), ... until iT l (i') = i. Clearly 7r —1 (z) = 7^ ^_1 (^ , ) in this case. It is easy 
to check that we must compute ir at most t times; details can be found in [32] . Thus 7r _1 (?') is 
computed in 0{t ■ log log a) time. We already showed how to answer select query using 7T” 1 . Hence 
select 0 (z, C) is also answered in 0(t log logo - ) time. To answer a rank query rank a (i,C), we first 
find the largest j £ F a such that j < i. If rank a (j, C) = s ■ t, then st < rank a (z, C) < (s + 1 )t. We 
can find the exact value of rank a (i,C < ) by answering O(logf) select queries as described in[15115]. 
Hence rank a (i, C) is computed in 0(t log t log log a) time. We set t = log a /(log log u) 3 . Hence a 
query rank a (i,C < ) is answered in O (log a/ log log a) time. 
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Linear Construction Time. The data structure described above can be constructed in O(n) 
time. We can split S into blocks in linear time. Then we stably sort each block and compute 
the number of times n a the symbol a € E occurs in a block. We can implement stable sorting 
by replacing each C[i\ with C'[i\ = C[i\ ■ a + i and applying radix sort to the resulting sequence. 
Using sorted array C', we can: (i) compute n(i) for each position i within a block; (ii) find values 
of n a for each symbol a and construct the sequence V; (iii) generate sets F' a and the array R. 
All these auxiliary structures can be created in linear time. We can construct a y-trie for F a in 
0(|P a |(loglog(x) 3 ) time: each element of a y-trie is kept in O(loglogfj) dictionary data structures; 
using the deterministic method described in [38], we can construct a dictionary with m < a elements 
in 0(m(loglog<j) 2 ) time. Hence the total time needed to construct a y-trie is 0(|F a |(loglogo - ) 3 ). 
Since all F a contain 0(a/t ) elements, y-tries for all F a are created in 0(cr(loglog<7) 3 /f + a) time. 
Since we can compute it( i) for each i in 0(1) time using C', we can produce a data structure for 
computing 7r _1 in linear time. Thus the data structure for answering rank and select queries in a 
block can be created in 0(a) time. When values of n a are known for all blocks we can construct 
global bit sequences B a for each a € E. 

Data Structure for log 1 / 2 n < a < 2 logl/3n . In this case the data structure can be constructed 
in less than linear time. We assume that the symbols of S are initially packed into words of log n 
bits so that each word contains 0(log CT n) symbols. We split the sequence S into blocks of size 
s = alogn. We keep exactly the same data structures for each block as in the case of a > 2 log /3n 
and bit sequences B a defined in the same way as above. We start by splitting S into blocks and 
producing an array C' for each block C so that C'[i] = C[i] ■ s + i. This step takes 0(s/ log 2 / 3 n) 
time. C' can be sorted in 0(n/ log 1 / 3 n) time, using the ideas of sorting algorithms for small 
integers described in |2j and [1] . Then we can traverse sorted array C' and generate sets that must 
be stored in data structures F a in 0((s/ log 2 / 3 n) + a) time. All F a contain 0(s/t) elements and 
can be constructed in 0((s/t)(loglogn) 3 + a) = 0((s/t)(loglog?r) 3 ) time. We traverse C' again 
and obtain R'[i] = f?[C"[i]] for each i. Given R ', we can construct R by “reverse sorting”. Let 
Fi[i] = C'[i\ * (logo - ) + R'[i\. That is, the first logo - most significant bits of R\[i] contain a symbol 
C[j] of the sequence C , the next logs bits contain its position j in C , the next log log a bits contain 
the value of R[j]. We sort R± according to the value of bits at positions log a + 1, ..., log a + log s 
(bits that correspond to the positions j of symbols in the original sequence) and then discard the 
first log a + log s bits. The resulting array is the array R. 

We can also use C to construct the bit sequence V: we traverse C' and compute n a for all a, 
1 < a < a. When all n a are known, we can produce V in 0(s/log 1 / 3 n) time; a data structure 
supporting rank and select on V can be also produced in 0(s/ log 1 / 3 n) time. 

Finally we need to create the data structure for computing 7r _1 . Recall that we have to find 
all cycles of length at least t and select every f-th element in a cycle. Let d = log^n for / = 1/6. 
During the first stage we create s/d tuples so that each tuple is of the form (j, 7r(j), vr 2 (j),..., vr r (j)) 
for some r < d and each integer i, 1 < i < s occurs in at most one tuple. First we obtain values 
vr(i) for all i € [l,s] and keep tuples (i,n(i)) in the array P\. Using C. we can obtain P\ in 
0(s/ log 1 / 3 n) time. We traverse P\ and remove all tuples (i,ir(i)) such that 7r(i) = i. Then we 
obtain the sequence P 2 that contains tuples (i,7r(i), 7r 2 (i)) for all i such that (i,7r(i)) is still in P\. 
We create a new instance P[ of Pi and sort all tuples by their second components. Elements of P[ 
are tuples (i, 7 r(i)) sorted by 7 r(i). Elements of Pi are tuples (i, 7 r(i)) sorted by i. Both Pi and P( are 
traversed simultaneously. If the j-th tuple in Pi is (ij,ir(ij)) and n(ij) = v, then the j- th tuple in 
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P[ is (v, 7r(u)). When we read the P\ [j] and P[[j] , we create the new tuple (ij,ir(ij), n(v) = ir 2 (ij)) 
and keep it in a sequence P 2 . When P 2 is constructed, we discard P [: then we traverse P 2 and 
remove all (i k , n(ik), vr 2 (ifc)) such that vr 2 (ifc) = tt (i k ). This procedure is iterated d — 1 times. 
During the /c-th iteration, we sort tuples in Pfc_i by their last components and obtain P' k _i■ Then 
we merge P k _i with Pi and obtain P k . We traverse P k and remove tuples (ij ,..., n k (ij)) satisfying 
7 T k {ij) = ij. Each iteration takes 0(s/log 1 / 3 n) time. Hence Pd is obtained in 0(s/ log 1//6 n) time. 

At the end of the first stage we obtain the sequence P^. Every value i that is not in a cycle of 
length v < d is stored in exactly one tuple of Pd- Hence Pd consists of s/d tuples. We can easily 
process all tuples in 0(\Pd\) time and find all values i, 1 < i < s, that must be marked. We can 
find 7T — *(z) for all marked positions i in 0(s/d ) time. Thus the structure for computing 7r _1 is 
constructed in 0(s/log 1 / 6 n) time. The total time needed to produce the static data structure for 
a sequence S is thus OdSI/log 1 / 6 n). 

Data Structure for a < log 1 / 2 n In the case when 0 is very small, we use a different data 
structure. We implement rank and select operations on S using the result of Theorem 13 in [6], 
Their data structure splits S into chunks of size log^n/2. Each chunk is kept as in [37]. We can 
traverse S and obtain compressed representation of each chunk in 0(n/ log CT n) time. We maintain 
certain bit sequences for chunks that are described in [6j and can be constructed in 0 (na/ log^ n) 
time. Since a < log 1 / 2 n, 0(ncr/ \og a n) = 0(n log log n/ log 1 / 2 n). This representation of S also 
supports fast substring extraction: since S is kept in chunks, we can decode all symbols from a 
chunk in 0(1) time and retrieve a string of length l in 0 {l/ \og a n) time. 

Theorem 4 There exists a data structure D that that stores a sequence S[l,n] in nH k + 
0(?r(loglog<j) 3 ) bits, where a is the alphabet size, and supports queries access, rank, and select 
in O (log n/ log log n) time. D can be constructed in 0{n) time. 

Suppose that a < 2 log n and S is initially stored inO(n/ log CT n ) words, so that every word contains 
@(log CT ra) consecutive symbols; then D can be constructed in 0(nj log 1 / 6 n) time. 

Finally we remark about re-building static sequences that is needed by background processes de¬ 
scribed in Section IA.41 When a subsequence S) is re-built, we retrieve 5) using the algorithm for 
substring extraction in 0(|S',;|/ log CT n ) time. The decoded sequence Si is then kept in uncompressed 
form; we keep S t in a sequence of words, so that each word contains log^ symbols. We can apply 
the construction algorithms described in this section to uncompressed sequence Si. The workspace 
needed to store S t in plain form is 0(|Si| logo - ) bits. 

A.8 Operation select / on S and Reporting a Substring of a Binary 
Sequence 

Operation select^*, S). Let S a be the subsequence of S that consists of all occurrences of a 
symbol a. We maintain a bit sequence W for each sequence S a . For every element of S a , we keep 
one or two consecutive bits in W. If the j-th occurrence of a is not stored in S, then we represent 
it by a 0; if the j-th occurrence is stored in S (i.e., it is stored in either S' a or So), then we represent 
it by a two-bit sequence 10. Let / denote the number of symbols in S' a among the first j symbols of 
S a . Then Sab] ' s represented by the (j + /)-th bit in W or by the (j + /)-th and (j + / + l)-st bits 
in IE: if S a [j] is stored in S' a , then W[f + j] = 1 and W[f + j + 1] =0; otherwise W[f + j] =0 and 
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W[f + j + 1] represents the next symbol in S a . We can answer rank and select queries on W and 
support updates on W in 0(logn/loglogn) time. Let v\ = selector, W) and = ranki(wi, W). 
Then select^ (i, S) = v 2 . 

Reporting a Substring in a Binary Sequence. Let M be a binary sequence. We prove the 
following Lemma: 

Lemma 11 Let M be a binary sequence of length n with 0(n/r ) 0-bits. We can store M in 
O((n/r) log r) bits, so that any substring M[i..i+1— 1] can be obtained in 0(log n/ log log n+l/ log n) 
time. Insertions and deletions are supported in 0(logn/loglogn) time. 

Proof: We store M using a variant of run-length encoding: each substring that consists of d 1-bits 
followed by a 0-bit, where 0 < d < 21og 2 n, is encoded as an integer d. For instance, a sequence 
100011110 will be encoded as 1,0,0,0,4. We divide the run-length encoded sequence into blocks, 
such that each block consists of at least logn/81oglogn and at most logn/4 run-lengths and the 
length of each block is at most logn/2 bits. Run-lengths are delta-encoded so that a run of length 
d uses logd + o(logd) bits. Thus each block contains fl(logn) bits. 

We also maintain an additional data structure A that finds for each position j in M, the 
run-length d that encodes M[j] and the block that contains the run-length d. A encodes every run- 
length in unary. Thus a run of length d is represented by l d 0. Since M contains 0(n/r ) 0-bits, M 
consists of 0(n/r) runs of l’s followed by a 0. Hence A consists of 0(n/r) 0’s and O(n) 1-bits. The 
sequence A! encodes in unary the number of runs in every block of M. Using standard methods, 
we can keep A and A' in 0((n/r) log r) bits and support queries and updates in 0(logn/loglogn) 
time. Using rank and select queries on A and A', we can find the block that encodes M\j] and 
the position of M[j] in its block for any j. 1 < j < n, in 0(logn/loglogn) time. We also keep a 
look-up table Tbl that enables us to retrieve all k elements stored in a block in 0(k/ log n) time; 
for every block, Tbl contains the sequence of bits encoded by this block. Since there are 0(n 1//2 ) 
different blocks and each block encodes a poly-logarithmic number of elements, Tbl uses o(n) bits. 

Each block contains either at least log 2 n/41oglogn 1-bits or at least logn/41oglogn 0-bits. 
Hence the total number of blocks is 0(n(loglogn/log 2 n) + (n/r) (log log n/log n)). Each block 
needs O(logn) bits. Hence all blocks use 0((n/r) log log n) bits. 

To extract a substring M[i..i + I — 1], we start by finding the block Bl that contains M[i] and 
the position of M[i\ in Bl. Then we simply decode the remaining part of the block Bl and the 
following blocks until 0 (£) symbols are decoded. □ 

A.9 Substring Extraction 

Now we show how the fully-dynamic data structure described in Section [5] supports the operation 
of retrieving a substring of length £. Suppose that we want to extract the substring S[i..i + £]. We 
keep a copy S w of subsequence So implemented as follows. S w is split into words, such that each 
word contains between log ff n/4 and \og a n /2 symbols of Sq. Let Wi be the number of symbols in 
the z-th word; we maintain a prefix-sum data structure on Wi. Using this data structure, we can 
find the word S w [j] that contains the z-th symbol of So in 0(logn/loglogn) time. We can find the 
position Oj of So[i] in that word in 0(1) time. Using table look-up, we can extract the remaining 
symbols of S w [ 7 ] in 0(1) time. liwj—Oi < £, we extract the following symbols from words S w \j + 1], 
S w [j + 2], ... until l symbols are reported. 
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The static data structure on Si can be used to extract i symbols in 0(log n/ log log n+i/ log^ n) 
time. Some of these symbols can, however, be marked as deleted. We use the following additional 
structures in order to extract @(log CT n) undeleted symbols in 0(1) time. Recall that each sequence 
Si is stored as a sequence of meta-symbols Sf 1 and every meta-symbol represents [" log ^ - ] symbols. 
We say that a meta-symbol [j] is spoiled if at least log CT nj 4 symbols represented by S/ r [j] are 
marked as deleted. A symbol is spoiled if it is stored in a spoiled meta-symbol. Positions of spoiled 
symbols are indicated by a binary sequence SPOSi. That is, SPOSi[j} = 1 iff the symbol Si[j] is 
not spoiled. Symbols stored in spoiled meta-symbols are also kept in a sequence V). Representation 
of Vi is similar to representation of S w , but it contains only undeleted symbols stored in spoiled 
meta-symbols. V) is divided into words and each word [j] contains up to log CT n/2 symbols. If a 
word [j] contains less than log CT n/4 symbols, than the last symbol in this word is followed by a 
non-spoiled symbol. Each word is augmented with a field next. Let fol(j) denote the symbol that 
follows the last symbol in V) M [j]. V t M [j].next = NULL if fol(j) is spoiled; otherwise V { M [j].next 
points to the position of fol(j) in S^ T . A sequence VPOSi indicates boundaries of words in Vf. 
VPOSi contains a O-bit for every symbol in V) that is not the last symbol in its word VPOSi 
contains a two-bit substring 01 for every symbol that is the last symbol in its word. Thus each 
symbol is encoded by a 0-bit and the end of every word in V is encoded by a 1-bit. If a symbol 
Si[j] is not marked as deleted and kept in a spoiled meta-symbol, then we can find the position of 
Si[j] in Vi by answering one rank query on SPOSi and one rank and one select query on VPOSi. 

The total number of symbols that are marked as deleted in all Si is bounded by 0(n/r). Hence 
the number of spoiled symbols in all Si is also 0(n/r). Non-deleted symbols kept in a spoiled meta¬ 
symbol are stored in at most three words of V/ 1 . Hence the total number of words in all V t M is 
bounded by 0(n/(r log CT n)). Since every word uses 0(log n) bits of space, all V) need 0((n/r) log a) 
bits. All bit sequences VPOSi and SPOSi use 0(n/r) and 0((n/r) log ?’) bits respectively. Hence 
we need 0((n/r) log u) additional bits in order to support substring extraction. 

Suppose that a string St{i..i + l} must be extracted. We find the meta-symbol S / 1 [jo] that 
contains St[i] and decode meta-symbols Sf 1 [jo +1], ... and output the appropriate symbols 

until l symbols are reported or a spoiled meta-symbol is encountered. If the symbol S / 4 [j] is spoiled, 
we find the position of Sj^\j] in Vt and output symbols from V). If we enumerated symbols of V t M \ji\ 
and V t M [ji].next / Null, then we switch back to where S/ 4 [ 72 ] is the meta-symbol that is 

pointed to by V t M \ji].next, and decode symbols from S^\j 2 + 1], • • • until a spoiled symbol 

is encountered. We output symbols from V t M \j 1 + 1], .. ., V) M [ 72 ] until V t M [j 2 \.next / NULL. We 
proceed in the same way until l symbols are decoded. Each meta-symbol of St and each word of V) 
is processed in 0(1) time. It is easy to check that the total number of words and meta-symbols is 
bounded by 0(1/ log CT n). Every retrieved non-spoiled symbol in S) M , except for the first one and 
the last one, contains 0(log CT n) symbols. Every processed word in V t M , except for the last one, 
either contains ©(log^ n) symbols or is followed by a non-spoiled meta-symbol. The position of the 
first accessed spoiled symbol in V) is computed in O (log n/ log log n) time. The position of the first 
accessed meta-symbol in S / 4 is also computed in O (log n/ log log n) time. Thus the total query 
time is 0 (log n/ log log n + l/ log a n). 

The extraction of l symbols S[i..i + l] from the global sequence S is implemented as follows. 
We find io = ranko(i, R) and i\ = ranki(i,i?). We compute t such that Size[j\ < h < 

Size[j] and extract substring St[f..f + /] for f = ii — Size[j]. If the end of St is 

reached, we extract remaining symbols from St+ 1 , St+ 2 , ■ ■ •• We also extract So[?'o-4o + A- Let Str\ 
be the substring extracted from the static subsequence (or subsequences) and let Sir 0 be the string 
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extracted from So- We can merge the prefix of Str\ with the prefix of Stro using R. At each step 
we consider the next log^n/6 symbols of Stro and log^n/6 symbols of Str\ that are not processed 
yet. Suppose that these symbols are stored in words Wo and W\ respectively. We read the next 
log CT n/6 bits of R and keep them in a bit sequence Rw- Using a look-up table, we can obtain the 
sequence W res that consists of log^ n/6 following symbols in 0(1) time: if Rw[j\ = U then the j-th 
symbol of W res is the ro-th symbol of Wo where ?’o is the number of O’s among the first j bits of 
Rw'-, otherwise the j-th symbol of W res is the ri-th symbol of W\ where r\ is the number of l’s 
among the first j bits of Rw- The sequence W res contains the next log CT n/6 symbols of S[i..i + l]. 
Proceeding in the same way, we can obtain the substring S[i..i + Z] in 0(6Z/log CT n) = 0(1 /\og a n) 
time. 
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